Hassan Mojeed
Linkedin Profile
mojeed.o.hassan@gmail.com
In this project, I will conduct an in-depth analysis of long-term rental pricing in Canada, leveraging the booming real estate market. The primary goal of this project is to uncover patterns related to apartment types, costs, and locations. By deciphering the dynamics that influence housing trends nationwide, this analysis aims to offer valuable insights for both tenants and investors.
The data used in this project was scraped from Kijiji, a well-known Canada's leading local classifieds platform, facilitating the buying, selling, and trading of a wide range of items, including cars, real estate, jobs, services, and more. It operates as a subsidiary of eBay and has a significant presence in Canada, as well as in other countries like Switzerland, Austria, and Japan.
I will be performing some activities such as;
Looping: to iteratively navigate through pages, and
Extracting vital information such as:
To efficiently parse the HTML content, I will use the BeautifulSoup library.
The data processing and cleaning procedures in the involve employing diverse techniques. To extract attribute-based text, a custom function called extract_text is utilized, while specific HTML tags are employed to target relevant information.
The data is transformed into a Pandas DataFrame after undergoing a cleaning process. Subsequently, the final DataFrame is exported to an Google sheet file in order to facilitate further analysis. This step guarantees a consistent and easily accessible format for subsequent stages.
Utilizing Power BI, the exported data is transformed into visually appealing dashboards and impactful visual representations of key metrics. By harnessing the capabilities of Power BI, stakeholders gain a holistic view of rental patterns, facilitating informed decision-making in the ever-evolving real estate market.
# Import necessary libraries
from bs4 import BeautifulSoup # BeautifulSoup is used for web scraping
import requests # Used for making HTTP requests
import pandas as pd # Pandas library for data manipulation and analysis
import numpy as np
from df2gspread import df2gspread as d2g
import os
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import warnings
warnings.filterwarnings("ignore")
page_numbers = np.arange(1,47)
# Initialize an empty list to store apartment rental details
apartment_rental_details = []
# Loop through the pages
for page_number in page_numbers:
# Define the base URL for Kijiji apartment listings
base_url = "https://www.kijiji.ca/b-apartments-condos/canada/page-"
# URL separator for sorting by date in descending order
url_separator = "/c37l0"
# Construct the full URL for the current page
url = base_url + str(page_number) + url_separator
# Make an HTTP GET request to the URL
response = requests.get(url)
# Parse the HTML content of the page using BeautifulSoup
page = BeautifulSoup(response.text, "html.parser")
# Find the container holding the apartment listings
listings = page.find_all("div", class_="sc-a1f24150-0 AgbZX")[10]
# Find all individual apartment sections within the container
apartments = listings.find_all("ul", {"class" : "sc-6cfad96a-0 jYHHlY"})[1]
# Loop through each apartment section
for apartment in apartments:
# Define a function to extract text from specific HTML elements
def extract_cost(tag, attribute):
try:
return apartment.find(tag, attribute).text.replace(",", "")
except AttributeError:
return None
def extract_text(tag, attribute):
try:
return apartment.find(tag, attribute).text
except AttributeError:
return None
# Extract apartment details: type, rental cost, location, and description
apartment_type = extract_text("h3", {"class": "sc-c54bbc09-0 KhHgs sc-a7f125aa-8 hIkVmb"})
rental_cost = extract_cost("div", {"class": "sc-d6110703-0 cOYpsS"})
location = extract_text("div", {"class":"sc-3ae9d016-0 WYiPw"})
description = extract_text("p", {"class" : "sc-c54bbc09-0 loTzYZ sc-a7f125aa-10 jA-dWnw"})
# Append the apartment details to the list
apartment_rental_details.append([apartment_type, rental_cost, location, description])
# Create a Pandas DataFrame using the apartment_rental_details list
# Specify column names as "Apartment_type", "Rental_cost", "Location", and "Description"
df = pd.DataFrame(apartment_rental_details, columns=["Apartment_type", "Rental_cost", "Location", "Description"])
df.shape
(1840, 4)
# Filter rows in the DataFrame where "Rental_cost" is not equal to "please Contact"
df = df[df["Rental_cost"] != "Please Contact"].dropna()
# Reset the index of the DataFrame and create a new DataFrame 'data'
df1 = df.reset_index(drop=True)
data= df1.drop_duplicates().reset_index(drop=True)
data["Rental_cost"] = [x.strip("$") for x in data["Rental_cost"]]
data.isna().sum()
Apartment_type 0 Rental_cost 0 Location 0 Description 0 dtype: int64
data.shape
(1686, 4)
path = os.getcwd()
file = path + '/cool-ship-407420-9aadad67f295.json'
file
'/Users/mj/Projects/Projects/webscrapingProject/cool-ship-407420-9aadad67f295.json'
# Import necessary libraries
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive', 'https://www.googleapis.com/auth/drive.file', 'https://www.googleapis.com/auth/spreadsheets']
creds = ServiceAccountCredentials.from_json_keyfile_name(file, scope)
client = gspread.authorize(creds)
# Create a new Google Sheets file with the name 'Rental Price Perspective'
spreadsheet = client.create('Rental Price Perspective')
# Specify the spreadsheet key for the newly created Google Sheets file
spreadsheet_key = '1ixobCbTWoQAJ-vXYOuu3ruvcULtWtKlhIQXoTQpUk_8'
# Uncomment the following lines if you want to work with the default first sheet
# Get the default first sheet
# sheet = spreadsheet.sheet1
# Update the Google Sheets with the DataFrame
# sheet.update([data.columns.values.tolist()] + data.values.tolist())
# Define the worksheet name and starting cell for the data upload
worksheetName = 'Sheet1'
starting_cell = 'A1'
# Upload the data to Google Sheets using the d2g library
d2g.upload(data, spreadsheet_key, worksheetName, credentials=creds, col_names=True, row_names=False, start_cell=starting_cell, clean=False)
# Print a success message
print('The sheet is uploaded successfully')
The sheet is uploaded successfully